The data set used for this project was collected from 422 adult patients who underwent liver transplant surgery between 2009-2019 at Vanderbilt University Medical Center. The original dataset includes 140 variables including demographic information, pre-transplant and post-transplant comorbidities, surgical variables, and outcome information including days in various statuses, infections and mortality. Additionally, pre-transplant and post-transplant follow up CT scans were taken and analyzed for measures of body composition. Some patients have follow-up CT scans for up to 10 years, but many have only a few years of follow-up. It is important to note that the dates of liver transplants span from 01/09/2009 – 12/23/2018 while follow-up data was continued to be collected until summer 2021.
# Loading the Data
data <- box_read("942618589040")
Currently each observation is for a unique liver transplant patient. To model state transition probabilities for each patient overtime, the data needs to be processed so that each observations is for one day for each patient and their state on that given day.
Additionally, there is some cleaning/grouping of categorical variables that needs to perform before diving in to the analysis. Here is the first 5 rows for all observations to get an idea of what the data looks like.
paged_table(head(data))
There are some categorical variables in which certain categories have a very small number of patients. Additionally, some categories are very similar in nature. In these instances, similar and small categories can be grouped in ways that are useful for modeling without losing too much important information.
There are currently 5 categories for location of discharge.
table(data$locationof_d_chome0snf1ipr2ltac3home_pt4) %>%
kbl(col.names = c("Location of Discharge", "Frequency"), align = 'c') %>% kable_minimal
| Location of Discharge | Frequency |
|---|---|
| 0 | 211 |
| 1 | 2 |
| 2 | 87 |
| 3 | 3 |
| 4 | 115 |
As you can see above, home and home with PT are very similar and hard to distinguish so these can be grouped. The other 3 groups all have smaller amounts of patients and are similar types of facilities that a patient could end up in after being discharged from the hospital so these can be grouped into a ‘long term care’ category.
data <- data %>% mutate(location_dc = case_when(
locationof_d_chome0snf1ipr2ltac3home_pt4 == 0 ~ 0,
locationof_d_chome0snf1ipr2ltac3home_pt4 == 1 ~ 1,
locationof_d_chome0snf1ipr2ltac3home_pt4 == 2 ~ 1,
locationof_d_chome0snf1ipr2ltac3home_pt4 == 3 ~ 1,
locationof_d_chome0snf1ipr2ltac3home_pt4 == 4 ~ 0
))
table(data$location_dc) %>%
kbl(col.names = c("New Location of Discharge", "Frequency"), align = 'c') %>% kable_minimal
| New Location of Discharge | Frequency |
|---|---|
| 0 | 326 |
| 1 | 92 |
Now locationDC is a new variable with the following groups:
There is a not big difference between different types of biliary complications and some of the categories have very small numbers as shown in the table below.
table(data$biliarycomplications_21bileleak2biloma3stricture4other) %>%
kbl(col.names = c("Biliary Complications", "Frequency"), align = 'c') %>% kable_minimal
| Biliary Complications | Frequency |
|---|---|
| 1 | 43 |
| 2 | 5 |
| 3 | 73 |
| 4 | 15 |
Currently the values for this variable are
This can be transformed into a yes/no variable as to whether the patient had any biliary complications or not.
data <- data %>% mutate(biliarycomplication = case_when(
is.na(biliarycomplications_21bileleak2biloma3stricture4other) ~ 0,
biliarycomplications_21bileleak2biloma3stricture4other > 0 ~ 1
))
table(data$biliarycomplication) %>%
kbl(col.names = c("New Biliary Complications", "Frequency"), align = 'c') %>% kable_minimal
| New Biliary Complications | Frequency |
|---|---|
| 0 | 286 |
| 1 | 136 |
Now biliarycomplication is a new variable with the following groups: 0. No 1. Yes
There are quite a few variables that indicate whether a patient had a condition (yes or no) before their liver transplant and then again whether they had the condition after their transplant. I do not have any information on what pre and post mean in this context such as when these diagnoses happened. Therefore, what is more interesting is whether a patient has had a condition at all. Therefore, these 8 binary variables can be transformed into 4 indicator variables of whether the patient has had the condition at some point regardless of whether is was pre or post liver transplant surgery. This also is helpful for modeling as some of these indicator variables have low values and grouping them can resolve this issue.
data <- data %>% mutate(copd = case_when(
(copd_pre == 0 & copd_post == 0) ~ 0,
(copd_pre == 1 & copd_post == 1) ~ 1,
(copd_pre == 0 & copd_post == 1) ~ 1,
(copd_pre == 1 & copd_post == 0) ~ 1
),
htn = case_when(
(htn_pre == 0 & htn_post == 0) ~ 0,
(htn_pre == 1 & htn_post == 1) ~ 1,
(htn_pre == 0 & htn_post == 1) ~ 1,
(htn_pre == 1 & htn_post == 0) ~ 1
),
dm = case_when(
(dm_pre == 0 & dm_post == 0) ~ 0,
(dm_pre == 1 & dm_post == 1) ~ 1,
(dm_pre == 0 & dm_post == 1) ~ 1,
(dm_pre == 1 & dm_post == 0) ~ 1
),
cad = case_when(
(cad_pre == 0 & cad_post == 0) ~ 0,
(cad_pre == 1 & cad_post == 1) ~ 1,
(cad_pre == 0 & cad_post == 1) ~ 1,
(cad_pre == 1 & cad_post == 0) ~ 1
),
ckd = case_when(
(ckd_pre == 0 & ckd_post == 0) ~ 0,
(ckd_pre == 1 & ckd_post == 1) ~ 1,
(ckd_pre == 0 & ckd_post == 1) ~ 1,
(ckd_pre == 1 & ckd_post == 0) ~ 1
),
ctd = case_when(
(ctd_pre == 0 & ctd_post == 0) ~ 0,
(ctd_pre == 1 & ctd_post == 1) ~ 1,
(ctd_pre == 0 & ctd_post == 1) ~ 1,
(ctd_pre == 1 & ctd_post == 0) ~ 1
)
)
Here for each of these new variables 0. no condition pre or post surgery 1. condition either or both pre and post surgery
data %>% select(copd, htn, dm, cad, ckd, ctd) %>% head() %>% kbl() %>% kable_minimal
| copd | htn | dm | cad | ckd | ctd |
|---|---|---|---|---|---|
| 0 | 1 | 1 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 | 1 | 0 |
| 0 | 1 | 1 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 1 | 0 |
The states that patients can transition from post liver transplant from best to worst are
The typical order a patient will go through post liver transplant are in the ICU and/or on a ventilator, in the hospital, at an inpatient long term facility or at home, and then finally death. This is the most common transition overtime, however, not all patients will follow this trend, may skip states and may never reach certain states.
Ideally, this analysis would be improved with being able to distinguish between a patient being at home or at a long term facility, however, unfortunately this level of information was not provided accurately in the data.
Currently, the data has the number of days in each of these states or until these states, but does not show patients moving through these different states overtime. It needs to be processed to obtain a variable for states and time. Here we can see the data provided with days in each state.
data %>% select(iculos, index_los, days_until_death, dayson_ventpost_lt) %>% head()
To create the outcome states, there must be no missing observations for any of the state variables.
data %>% select(iculos, dayson_ventpost_lt,index_los) %>% gg_miss_var(show_pct=TRUE)
Here we can see there is one missing observation for iculos and one missing observation for dayson_ventpost_lt. These were imputed using multiple imputation and predictive mean matching with 5 iterations.
set.seed(5225)
a <- aregImpute(~age + height + weight + ascites_yn + he + race + sex + smoker + cit +
etiology + meld + acr + newmalignancy + surgeryduration +
readmissionwithin90d + infection_90 + need_repeat_surgery +
vascularcomplicationswithin90days + sma00 + vat00 + sat00 + s_mhu00 + v_fhu00 +
sa_thu00 + biliarycomplication + index_los + days_until_death + iculos + location_dc
+ I(dayson_ventpost_lt) + I(htn) + I(dm) + I(ckd) + copd + cad + ctd,
data = data, n.impute = 5, x = TRUE)
imputed_data <- as.data.frame(a[['x']]) %>% select(iculos, dayson_ventpost_lt)
#original data with imputed values
data <- cbind(data %>% select(-iculos, -dayson_ventpost_lt), imputed_data)
All patients start in the ICU and or on a ventilator post liver transplant. The maximum of these two values is used to determine the days in the state ICU/Ventilator. Then patients will transition to the regular hospital ward for the remainder of index_los days (total length of hospital stay).
data$icu_vent <- pmax(data$dayson_ventpost_lt, data$iculos)
data['days_hospital'] <- ifelse(data$index_los - data$icu_vent >= 0,
data$index_los - data$icu_vent, 0)
There is no current indicator for days at home or days in a rehab facility. As there is no readmission data or any further dates of changes, it must be assumed that once a patient is discharged to home or an inpatient facility, that the rest of the days should be home or inpatient facility (not yet accounting for deaths). Days at home for patients are only created until the maximum date for the longest stay in the hospital in the data set plus 30 days (this is a total of 95 days). Although there is death information for up to 10 years for some patients, the analysis will not go beyond a few months. Additionally, there is no information on the cause of death, therefore the further post liver transplant a patient dies, the less is known about how much the liver transplant itself factors in to the patient’s death.
#find total days in hospital
data <- data %>% mutate(total_days = icu_vent + days_hospital)
#find the maximum number of days of hospital stay in data
days <- data %>% select(total_days)
max_day <- max(days[!is.na(days),]) + 30
#create home/facility days
data <- data %>% mutate(days_home_facility = max_day - total_days)
Now that the number of days each patient was in each state has been created, a row for each day for each patient can be derived with the corresponding state for that day.
The following code creates a function to replicate as many rows as there are days for a given state.
days_per_state <- function(df, column){
new_df <- data.frame(matrix(ncol = length(data) + 1, nrow = 0))
colnames(new_df) <- c(colnames(data),'state')
for (i in 1:nrow(df)){
num_rows <- df[i,column]
if (num_rows == 0){
invisible()
} else{
for (j in 1:num_rows){
d <- df[i,]
d['state'] <- column
new_df <- rbind(new_df, d)
}
}
}
return(new_df)
}
The next code chunk runs the new function on each of the days in a state columns to create additional rows for each patient for each observation day over the whole time period.
processed_data <- data.frame(matrix(ncol = length(data) + 1, nrow = 0))
colnames(processed_data) <- c(colnames(data),'state')
columns <- c('icu_vent', 'days_hospital','days_home_facility')
for (column in columns){
new_data <- days_per_state(data, column)
processed_data <- rbind(processed_data, new_data)
}
A variable to indicate time (in days) also needs to be added to the data. For this variable each observation is one day and there should be the same number of days for each patient (95 days in total).
processed_data <- ddply(processed_data, .(patient_mrn), transform,
time = seq_along(patient_mrn)-1)
The only state that has not yet been accounted for is death. If a patient dies during the time period, their state needs to be changed to death on the day they died and then all states after that should be converted to death because death is an absorbing state.
patients_died <- processed_data %>% filter(days_until_death <= max_day) %>%
select(patient_mrn) %>% unique()
for (patient in patients_died$patient_mrn){
df_d <- processed_data %>% filter(patient_mrn == patient)
death_date <- as.numeric(df_d$days_until_death[1])
processed_data[(processed_data$patient_mrn == patient),][death_date:max_day,'state'] <- 'dead'
}
Thee states can be encoded as numbers like the following for ease of use throughout the rest if the analysis:
processed_data <- processed_data %>% mutate(state = case_when(
state == "days_home_facility" ~ 1,
state == "days_hospital" ~ 2,
state == "icu_vent" ~ 3,
state == "dead" ~ 4
))
Now that the state outcome variable has been created, it can be used to create the previous state. This variable is important in a first order Markov state transition model in order to condition the predicted outcome on the previous state.
processed_data <- processed_data %>%
mutate(prev_state = ifelse(time==0,0, lag(state)))
processed_data <- processed_data %>% select(-days_until_death, -index_los, -iculos,
-days_hospital, -days_home_facility, -total_days,
-icu_vent, -dayson_ventpost_lt)
processed_data$state <- factor(processed_data$state,
levels = c(1,2,3,4),
labels = c("Home/IPR", "Hospital", "ICU/Vent", "Dead"))
processed_data$prev_state <- factor(processed_data$prev_state,
levels = c(1,2,3,4),
labels = c("Home/IPR", "Hospital", "ICU/Vent", "Dead"))
Here we can see a table of the state and previous state variables.
table("State" = processed_data$state, "Previous State" = processed_data$prev_state) %>% kbl(align = 'c') %>% kable_minimal
| Home/IPR | Hospital | ICU/Vent | Dead | |
|---|---|---|---|---|
| Home/IPR | 33717 | 402 | 16 | 0 |
| Hospital | 0 | 3122 | 404 | 0 |
| ICU/Vent | 0 | 0 | 1483 | 0 |
| Dead | 5 | 2 | 2 | 515 |
There are many variables that can be remove based on them being intermediary variables to clean other variables, domain knowledge, cleanliness, or due them being outcomes. These can be filtered out immediately.
These variables include longitudinal body composition measures from annual check ups, bmi (calculated based on height and weight which are already in the data), etc.
predictors <- processed_data %>% select(age, height, weight, ascites_yn,
he, race, sex, smoker, cit, etiology, meld, acr, surgeryduration,
sma00, smi00, vat00, sat00, s_mhu00, v_fhu00, sa_thu00,
copd, htn, dm, cad, ckd, ctd, time, prev_state)
Here is a list of the remaining variables and their data types.
print(str(predictors))
## 'data.frame': 40090 obs. of 28 variables:
## $ age : int 57 57 57 57 57 57 57 57 57 57 ...
## $ height : num 163 163 163 163 163 ...
## $ weight : num 80.2 80.2 80.2 80.2 80.2 80.2 80.2 80.2 80.2 80.2 ...
## $ ascites_yn : int 1 1 1 1 1 1 1 1 1 1 ...
## $ he : int 1 1 1 1 1 1 1 1 1 1 ...
## $ race : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sex : int 1 1 1 1 1 1 1 1 1 1 ...
## $ smoker : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cit : num 9.07 9.07 9.07 9.07 9.07 9.07 9.07 9.07 9.07 9.07 ...
## $ etiology : int 4 4 4 4 4 4 4 4 4 4 ...
## $ meld : int 27 27 27 27 27 27 27 27 27 27 ...
## $ acr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ surgeryduration: int 355 355 355 355 355 355 355 355 355 355 ...
## $ sma00 : num 138 138 138 138 138 ...
## $ smi00 : num 52.1 52.1 52.1 52.1 52.1 ...
## $ vat00 : num 182 182 182 182 182 ...
## $ sat00 : num 289 289 289 289 289 ...
## $ s_mhu00 : num 14.1 14.1 14.1 14.1 14.1 ...
## $ v_fhu00 : num -86.2 -86.2 -86.2 -86.2 -86.2 ...
## $ sa_thu00 : num -85.5 -85.5 -85.5 -85.5 -85.5 ...
## $ copd : num 0 0 0 0 0 0 0 0 0 0 ...
## $ htn : num 1 1 1 1 1 1 1 1 1 1 ...
## $ dm : num 1 1 1 1 1 1 1 1 1 1 ...
## $ cad : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ckd : num 1 1 1 1 1 1 1 1 1 1 ...
## $ ctd : num 0 0 0 0 0 0 0 0 0 0 ...
## $ time : num 0 1 2 3 4 5 6 7 8 9 ...
## $ prev_state : Factor w/ 4 levels "Home/IPR","Hospital",..: NA 3 3 3 3 3 3 3 2 2 ...
## NULL
Some of these possible predictors need to be converted to factors using the code below.
convert <- c('ascites_yn', 'he', 'race', 'sex', 'smoker', 'etiology', 'acr', 'copd',
'htn', 'dm', 'cad', 'ckd', 'ctd', 'prev_state')
for (var in convert){
predictors[var] <- as.factor(predictors[[var]])
}
Finally, we can get a better understanding of these possible predictors by looking at summary statistics and distributions for each one.
html(describe(predictors))
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 40090 | 0 | 48 | 0.999 | 55.68 | 10.69 | 36 | 43 | 51 | 57 | 63 | 67 | 68 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 40090 | 0 | 63 | 0.997 | 172.1 | 11.72 | 154.9 | 157.5 | 165.1 | 172.7 | 180.3 | 185.4 | 188.0 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 40090 | 0 | 347 | 1 | 86.82 | 23.73 | 56.40 | 60.78 | 71.00 | 85.00 | 98.88 | 114.39 | 124.51 |
| n | missing | distinct |
|---|---|---|
| 40090 | 0 | 2 |
Value 0 1 Frequency 7695 32395 Proportion 0.192 0.808
| n | missing | distinct |
|---|---|---|
| 40090 | 0 | 2 |
Value 0 1 Frequency 9310 30780 Proportion 0.232 0.768
| n | missing | distinct |
|---|---|---|
| 40090 | 0 | 2 |
Value 0 1 Frequency 36860 3230 Proportion 0.919 0.081
| n | missing | distinct |
|---|---|---|
| 40090 | 0 | 2 |
Value 0 1 Frequency 25270 14820 Proportion 0.63 0.37
| n | missing | distinct |
|---|---|---|
| 40090 | 0 | 2 |
Value 0 1 Frequency 35625 4465 Proportion 0.889 0.111
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39995 | 95 | 300 | 1 | 5.741 | 2.622 | 2.48 | 3.12 | 3.98 | 5.38 | 7.10 | 8.88 | 9.88 |
| n | missing | distinct |
|---|---|---|
| 40090 | 0 | 5 |
Value 0 1 2 3 4 Frequency 10925 9880 11970 3800 3515 Proportion 0.273 0.246 0.299 0.095 0.088
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39995 | 95 | 41 | 0.999 | 22.14 | 10.43 | 8 | 10 | 15 | 22 | 28 | 34 | 38 |
| n | missing | distinct |
|---|---|---|
| 40090 | 0 | 2 |
Value 0 1 Frequency 36670 3420 Proportion 0.915 0.085
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39995 | 95 | 204 | 1 | 312.9 | 78.5 | 217 | 233 | 258 | 305 | 353 | 400 | 456 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39900 | 190 | 420 | 1 | 164.2 | 44.18 | 108.2 | 117.6 | 135.0 | 159.5 | 189.6 | 212.5 | 231.4 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 38285 | 1805 | 403 | 1 | 55.47 | 13.14 | 40.11 | 42.70 | 47.13 | 53.63 | 62.20 | 70.46 | 77.18 |
| lowest : | 20.72918 | 28.46148 | 29.14429 | 29.93326 | 33.68552 |
| highest: | 93.29691 | 94.08640 | 97.15067 | 98.00957 | 111.65846 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39805 | 285 | 419 | 1 | 115.2 | 89.39 | 14.06 | 25.08 | 53.44 | 98.32 | 158.58 | 220.54 | 268.63 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39805 | 285 | 419 | 1 | 232.9 | 143.4 | 48.29 | 80.15 | 135.11 | 211.87 | 322.35 | 407.34 | 469.77 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39805 | 285 | 419 | 1 | 29.31 | 10.55 | 14.18 | 16.92 | 22.44 | 29.51 | 35.38 | 41.96 | 45.23 |
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39805 | 285 | 419 | 1 | -79.59 | 9.381 | -95.84 | -90.72 | -85.14 | -78.45 | -73.68 | -69.72 | -67.91 |
| lowest : | -107.1590 | -105.7500 | -104.2460 | -103.8350 | -102.6440 |
| highest: | -62.7276 | -62.1905 | -62.1478 | -61.3434 | -59.4222 |
n missing distinct Info Mean Gmd .05 .10 .25
39805 285 419 1 -82.99 16.28 -103.63 -100.24 -94.16
.50 .75 .90 .95
-84.92 -73.43 -62.50 -55.56
| lowest : | -114.3210 | -111.9750 | -109.0650 | -108.3390 | -108.2930 |
| highest: | -49.4904 | -49.3095 | -48.6547 | -46.9401 | -45.9087 |
| n | missing | distinct |
|---|---|---|
| 39995 | 95 | 2 |
Value 0 1 Frequency 37145 2850 Proportion 0.929 0.071
| n | missing | distinct |
|---|---|---|
| 39995 | 95 | 2 |
Value 0 1 Frequency 23370 16625 Proportion 0.584 0.416
| n | missing | distinct |
|---|---|---|
| 39995 | 95 | 2 |
Value 0 1 Frequency 25270 14725 Proportion 0.632 0.368
| n | missing | distinct |
|---|---|---|
| 39995 | 95 | 2 |
Value 0 1 Frequency 28690 11305 Proportion 0.717 0.283
| n | missing | distinct |
|---|---|---|
| 39995 | 95 | 2 |
Value 0 1 Frequency 26220 13775 Proportion 0.656 0.344
| n | missing | distinct |
|---|---|---|
| 40090 | 0 | 2 |
Value 0 1 Frequency 39615 475 Proportion 0.988 0.012
| n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 40090 | 0 | 95 | 1 | 47 | 31.66 | 4 | 9 | 23 | 47 | 71 | 85 | 90 |
| n | missing | distinct |
|---|---|---|
| 39668 | 422 | 4 |
Value Home/IPR Hospital ICU/Vent Dead Frequency 33722 3526 1905 515 Proportion 0.850 0.089 0.048 0.013
The first method for variable selection is to look at the percentage of missing values in each of the possible features. Imputation can be used, but if any variables are missing a very large portion of data, they might not be useful in a model.
gg_miss_var(predictors, show_pct=TRUE)
Luckily, there are not very many observations missing for any of these variables. The one missing the most is smi00 but it is only missing ~4.5% of observation which can easily be imputed.
Now we can see if any variables are redundant, meaning they can easily be explained by other variables. If a variable is easily explained by the other possible predictors, it would be redundant to use it in a model, especially when there are many possible predictors to use for modeling.
r <- redun(as.formula(paste0("~ ",paste0(colnames(predictors),collapse = "+"))), type = 'adjusted', data=predictors, r2 = 0.8)
r['rsq1'] %>% kbl(col.names = "Adjusted $R^2$")
|
#r['rsquared'] %>% kbl(col.names = "Adjusted $R^2$")
With an adjusted R^2 of 0.99 the variance in height is easily explained by the other variables. This actually makes sense due to the fact that \(smi00 = sma00/height\).
redun(~ smi00 + height + sma00, type = 'adjusted', data=predictors, r2 = 0.8)
##
## Redundancy Analysis
##
## redun(formula = ~smi00 + height + sma00, data = predictors, r2 = 0.8,
## type = "adjusted")
##
## n: 38285 p: 3 nk: 3
##
## Number of NAs: 1805
## Frequencies of Missing Values Due to Each Variable
## smi00 height sma00
## 1805 0 190
##
##
## Transformation of target variables forced to be linear
##
## R-squared cutoff: 0.8 Type: adjusted
##
## R^2 with which each variable can be predicted from all other variables:
##
## smi00 height sma00
## 0.989 0.992 0.989
##
## Rendundant variables:
##
## height
##
## Predicted from variables:
##
## smi00 sma00
##
## Variable Deleted R^2 R^2 after later deletions
## 1 height 0.992
It’s better to keep the original measurements in the model rather than derived measurements. Because smi00 (skeletal muscle mass index) is based on height and sma00, it can be removed.
predictors <- predictors %>% select(-smi00)
Now we can perform the redundancy analysis again to determine whether any other variables are redundant with smi00 removed.
r2 <- redun(as.formula(paste0("~ ",paste0(colnames(predictors),collapse = "+"))), type = 'adjusted', data=predictors, r2 = .8)
r2['rsq1'] %>% kbl(col.names = "Adjusted $R^2$")
|
#r2['rsquared'] %>% kbl(col.names = "Adjusted $R^2$")
There are still some body composition variables with pretty high adjusted R^2 values, meaning their variance can be explained by other variables, but because these are important variables to this analysis, they should be kept as predictors in the model.
Variable clustering can also be used to see if there are any strong clusters within the variables of interest.
#clustering using hoeffding D
vc <- varclus(as.formula(paste0("~ ",paste0(colnames(predictors),collapse = "+"))),
data = predictors, sim = 'hoeffding')
#clustering using spearman
vc2 <- varclus(as.formula(paste0("~ ",paste0(colnames(predictors),collapse = "+"))),
data = predictors, sim = 'spearman')
#joining the two cluster analyses
par(mfrow=c(2,1))
plot(vc)
plot(vc2)
Both methods of clustering, Hoeffding D and Spearman \(\rho^2\) produce similar results. v_fhu00 (visceral adipose tissue density) and sa_thu00 (subcutaneous adipose tissue density) cluster together with a high correlation, however as these are both important body composition variables in this analysis they will both be kept as predictors for modeling. They also cluster with vat00, which is visceral adipose tissue. weight and sat00 also cluster together which makes sense as sat00 is a body fat measure which could be related to weight. weight can be removed as all the body composition variables measured from CT scans are important to keep and weight is highly related to at least some of those measures.
predictors <- predictors %>% select(-weight)
height also seems to be highly correlated with sex. This makes sense as males are usually taller than females. Because height can be age-dependent in the elderly, its good to check if that is the case within this set of patients.
d <- datadist(predictors)
options(datadist="d")
f <- orm(height ~ rcs(age,4)*sex, data = predictors)
qu <- Quantile(f); med <- function(x) qu(.5, x)
ggplot(Predict(f, age, sex, fun=med, conf.int=FALSE), ylab='Predicted Median Height, cm')
For patients younger than 40, there seems to be odd behavior but the average is fairly flat for all patients above 40 meaning there is no relationship with height decreasing in the elderly. At this stage in the process, it makes sense to include both height and sex.
The final variables being included as possible predictors for modeling are:
age, height, ascites_yn, he, race, sex, smoker, cit, etiology, meld, acr, surgeryduration, sma00, vat00, sat00, s_mhu00, v_fhu00, sa_thu00, copd, htn, dm, cad, ckd, ctd, time, prev_state
#adding outcome variable to data set
data <- cbind(predictors,state = processed_data$state, patient_mrn = processed_data$patient_mrn)
#filtering out day 0 with no previous state
data <- data %>% filter(!is.na(prev_state))
Even though this is a pretty large data set, because the data only contains 422 patients, its important to keep every single patient to train a model. Therefore, missing values need to be imputed.
gg_miss_var(data, show_pct=TRUE) + labs(title = "Percent of Missing Data")
As shown similarly during variable selection as well as in the plot above, there are only a small percentage of missing values, less than 1% for any variable which is really promising. Multiple imputation using predictive mean matching with 5 iterations of imputation is used to impute missing values. All variables including the outcome variable are used to impute all other variables in order to get the best imputation results.
set.seed(578)
data$state <- factor(data$state,
levels = c("Home/IPR", "Hospital", "ICU/Vent", "Dead"))
data$prev_state <- factor(data$prev_state,
levels = c("Home/IPR", "Hospital", "ICU/Vent", "Dead"))
a <- aregImpute(~ age + height + ascites_yn + he + race + sex + smoker + cit +
etiology + meld + acr + surgeryduration + sma00 + vat00 + sat00 +
s_mhu00 + v_fhu00 + sa_thu00 + htn + dm + ckd +
copd + cad + ctd + prev_state + time + state + patient_mrn,
data = data, n.impute = 5, x = TRUE)
data <- as.data.frame(a[['x']])
convert <- c('ascites_yn', 'he', 'race', 'sex', 'smoker', 'etiology', 'acr', 'copd',
'htn', 'dm', 'cad', 'ckd', 'ctd')
for (var in convert){
data[var] <- factor(data[[var]], )
}
data$state <- factor(data$state,
levels = c(1,2,3,4),
labels = c("Home/IPR", "Hospital", "ICU/Vent", "Dead"))
data$prev_state <- factor(data$prev_state,
levels = c(1,2,3,4),
labels = c("Home/IPR", "Hospital", "ICU/Vent", "Dead"))
Here is an event chart for a random sample of patients removing all days post death if this state occurs in the random sample. This just gives a flavor of what states patients go through over time and shows trends in the data. As you can see, all patients start in the ICU and most end at home or in an inpatient long term facility. The black square at the bottom of the plot indicates that patient died on day 9 and all remaining observations are removed as death is an absorbing state.
set.seed(218)
ssamp <- sample(unique(data$patient_mrn), 30, FALSE)
dr <- subset(data, patient_mrn %in% ssamp)
dr <- subset(dr, prev_state != 'Dead')
dr <- subset(dr, time <= 61)
dr$id <- as.integer(as.factor(dr$patient_mrn))
dr$status <- factor(dr$state, levels=rev(levels(dr$state)))
dr$time <- dr$time - 1
datad <- datadist(dr)
options(datadist="datad")
multEventChart(state ~ time + id, data=dr,
absorb='Dead', sortbylast = FALSE) +
theme_classic() +
theme(legend.position='bottom') +
labs(title = "State Transitions for 60 Days Post Liver Transplant", y = "Days") +
scale_y_continuous(
breaks = seq(0,60,5))
The plot below shows a summary of successive state transitions over the first 30 days after liver transplant. All patients start in the ICU on day 0. As you can see here on day 1 already many patients move out of the ICU and into a different ward in the hospital. By the end of this time period the most common transition is from hospital to home/IPR which makes sense as patients are discharged after recovering from their liver transplant.
d <- data %>% filter(time < 31)
propsTrans(state ~ time + patient_mrn, data=d, maxsize=6) +
theme(legend.position='bottom', axis.text.x=element_text(angle=90, hjust=1))
The Proportional Odds model relies on the assumption that there are proportional odds between different states. To see whether the proportional odds assumption is a reasonable assumption to make based on the data, we can look at some plots and check if there is parallelism between states.
Here is a plot of the proportion of each state over time. As mentioned before, on day one (day after surgery transplant) most patients are still in the ICU with only a few in the hospital (not ICU). The small pink band at the top shows deaths. This area will always be increasing as death is an absorbing state. By day 65, all patients have been either discharged to home or a facility or have passed away.
propsPO(state ~ time, data=data, nrow=1) +
labs(title = "Proportion of Patients in Each State Over Time", x = "Time") +
theme(legend.position='bottom', axis.text.x=element_text(angle=90, hjust=1))
To understand if the proportional odds assumption holds, we can look at the observed proportions on top of the expected proportions assuming the proportional odds assumption been satisfied for comparison. This was computed using time as a sole predictor as a flexible spline. The odds ratios are computed at day 1 and then applied to the observed proportions on day 1.
datad <- datadist(data)
options(datadist="datad")
f <- lrm(state ~ rcs(time, 5), data=data)
g <- Function(f) # derive predicted log odds as a function of day
# Make another function that gets the OR vs. day=1
dor <- function(time) exp(g(time) - g(1))
propsPO(state ~ time, odds.ratio=dor, data=data) +
theme(legend.position='bottom', axis.text.x=element_text(angle=90, hjust=1))
With respect to time, the proportional odds assumption does not seem to hold. This was done carrying death forward. Although there are very little deaths in the subset of the first 34 days in this plot, as death is an absorbing state, it makes sense to terminate follow-up after death.
Here is the same plot as above but with no observations for a patient after they die.
dd <- subset(data, prev_state != 'Dead')
f2 <- orm(state ~ rcs(time, 5), data=dd)
g2 <- Function(f2) # derive predicted log odds as a function of day
# Make another function that gets the OR vs. day=1
dor <- function(time) exp(g2(time) - g2(1))
propsPO(state ~ time, odds.ratio=dor, data=dd) +
theme(legend.position='bottom', axis.text.x=element_text(angle=90, hjust=1))
After terminating follow-up after death, the proportional odds assumption seems far from true. This makes sense as all patients begin on day 0 in the ICU post liver transplant and therefore state is highly dependent on time.
Although the proportional odds model relies on the PO assumption and this data set does not meet this assumption, we can still get some beneficial results from a PO model. See here for details on why it is okay to use the proportional odds model when the PO assumption has not been met.
As death is absorbing, it isn’t important to predict death once a patient is already dead. That is not interesting and will only bias the interpretation of predicting states up until the day of death. For this reason, all observations after a patient has died are removed for the remainder of this analysis.
#truncate after death
data <- subset(data, prev_state != 'Dead')
data$prev_state <- factor(data$prev_state,
levels = c("Home/IPR", "Hospital", "ICU/Vent"))
The following formula from Section 4.4 of Regression Modeling Strategies was used to calculate the effective sample size and understand how many parameters a model can have without having issues of overfitting.
\[ \begin{aligned} n - \frac{1}{n^2}\sum_{i=1}^{k}n_{i}^3 \end{aligned} \]
table(data$state) %>% kbl(align = 'c') %>% kable_minimal
| Var1 | Freq |
|---|---|
| Home/IPR | 34135 |
| Hospital | 3526 |
| ICU/Vent | 1483 |
| Dead | 9 |
\[ \begin{aligned} 39153 - \frac{1}{39153^2}\sum_{i=1}^{4}n_{i}^3 = 39153 - \frac{1}{39153^2}(34135^3 + 3524^3 + 1485^3 + 9^3) = 39152.34 \end{aligned} \]
This is essentially the same as the sample size which gives a lot of flexibility for modeling. For example, using the 15:1 rule of thumb as a very basic guideline, the model could have \(39152.34/15 = 2610\) parameters and still satisfy the rule. This allows for a very flexible model with splines and interactions without concern.
To understand the relative importance of predictors, in order to allocate degrees of freedom and give more flexibility to the most important predictors, a model can be fit adjusting the variance-covariance matrix for within-patient correlation using the robust cluster sandwich covariance estimator. Continuous covariates have nonlinear effects using restricted cubic splines with 5 knots.
data$patient_mrn <- as.character(data$patient_mrn)
f <- orm(state ~ ctd + sex + rcs(age,5) + rcs(sma00,5) + rcs(height,5) + ascites_yn + he +
race + smoker + rcs(cit,5) + etiology + rcs(meld,5) + acr +
rcs(surgeryduration,5) + rcs(vat00,5) + rcs(sat00,5) +
rcs(s_mhu00,5) + rcs(v_fhu00,5) + rcs(sa_thu00,5) +
htn + dm + ckd + rcs(time,5) + prev_state, data=data, x=TRUE, y=TRUE, maxit = 30)
g <- robcov(f, data$patient_mrn)
plot(anova(g), trans=sqrt)
prev_state, meld, time and surgeryduration have the highest \(\chi^2\) values. These variables can be allocated more degrees of freedom, specifically using splines for continuous variables. There is flexibility to use splines on many predictors as the sample size is large enough to allow for more parameters.
Here is a model built to predict state for each day post liver transplant up to 90 days terminating follow-up after death. In addition to using splines to relax the linearity assumption, interaction terms can be added to the model. Based on domain knowledge, age and sex could both interact with body composition. Because sma00 (skeletal muscle area) had some correlation with sex, found in the variable clustering above, an interaction term is added between these two variables. Age can also interact with sma00 as skeletal muscle area can change differently with age.
The model is a PO model using the cluster sandwich covariance estimator to adjust the variance-covariance matrix for within-patient correlation.
m <- orm(state ~ ctd + (sex + rcs(age,5) + rcs(sma00,4))^2 - rcs(age,5):sex +
rcs(height,3) + ascites_yn + he +race + smoker + rcs(cit,3) + etiology +
rcs(meld,5) + acr + rcs(surgeryduration,5) + rcs(vat00,4) + rcs(sat00,3) +
rcs(s_mhu00,3) + rcs(v_fhu00,3) + rcs(sa_thu00,4) +
htn + dm + ckd + rcs(time,5) + prev_state,
data=data, x=TRUE, y=TRUE, maxit = 30)
g <- robcov(m, data$patient_mrn)
print(g)
Logistic (Proportional Odds) Ordinal Regression Model
orm(formula = state ~ ctd + (sex + rcs(age, 5) + rcs(sma00, 4))^2 -
rcs(age, 5):sex + rcs(height, 3) + ascites_yn + he + race +
smoker + rcs(cit, 3) + etiology + rcs(meld, 5) + acr + rcs(surgeryduration,
5) + rcs(vat00, 4) + rcs(sat00, 3) + rcs(s_mhu00, 3) + rcs(v_fhu00,
3) + rcs(sa_thu00, 4) + htn + dm + ckd + rcs(time, 5) + prev_state,
data = data, x = TRUE, y = TRUE, maxit = 30)
Frequencies of Responses
Home/IPR Hospital ICU/Vent Dead
34135 3526 1483 9
| Model Likelihood Ratio Test |
Discrimination Indexes |
Rank Discrim. Indexes |
|
|---|---|---|---|
| Obs 39153 | LR χ2 31342.15 | R2 0.913 | ρ 0.576 |
| Distinct Y 4 | d.f. 66 | g 4.224 | |
Cluster on data$patient_mrn |
Pr(>χ2) <0.0001 | gr 68.282 | |
| Clusters 422 | Score χ2 45993.60 | |Pr(Y ≥ median)-½| 0.489 | |
| Y0.5 1 | Pr(>χ2) <0.0001 | ||
| max |∂log L/∂β| 2×10-5 |
| β | S.E. | Wald Z | Pr(>|Z|) | |
|---|---|---|---|---|
| y≥Hospital | -2.7582 | 6.7917 | -0.41 | 0.6847 |
| y≥ICU/Vent | -10.0741 | 6.7857 | -1.48 | 0.1376 |
| y≥Dead | -16.9129 | 6.7853 | -2.49 | 0.0127 |
| ctd=2 | 0.0885 | 0.3752 | 0.24 | 0.8135 |
| sex=2 | 1.3584 | 1.1304 | 1.20 | 0.2295 |
| age | -0.1654 | 0.1488 | -1.11 | 0.2664 |
| age' | 1.0730 | 0.4378 | 2.45 | 0.0142 |
| age'' | -22.9066 | 7.4872 | -3.06 | 0.0022 |
| age''' | 52.9194 | 16.5439 | 3.20 | 0.0014 |
| sma00 | -0.0247 | 0.0450 | -0.55 | 0.5825 |
| sma00' | 0.0034 | 0.1497 | 0.02 | 0.9818 |
| sma00'' | -0.0211 | 0.4576 | -0.05 | 0.9633 |
| height | -0.0070 | 0.0086 | -0.81 | 0.4204 |
| height' | 0.0045 | 0.0091 | 0.49 | 0.6233 |
| ascites_yn=2 | -0.0717 | 0.1214 | -0.59 | 0.5550 |
| he=2 | 0.1558 | 0.0896 | 1.74 | 0.0822 |
| race=2 | -0.1439 | 0.1277 | -1.13 | 0.2598 |
| smoker=2 | -0.2583 | 0.1125 | -2.30 | 0.0216 |
| cit | -0.0528 | 0.0428 | -1.23 | 0.2175 |
| cit' | 0.0446 | 0.0534 | 0.84 | 0.4037 |
| etiology=2 | 0.2410 | 0.1122 | 2.15 | 0.0317 |
| etiology=3 | 0.2149 | 0.1144 | 1.88 | 0.0604 |
| etiology=4 | 0.1508 | 0.1375 | 1.10 | 0.2728 |
| etiology=5 | -0.1083 | 0.1297 | -0.84 | 0.4035 |
| meld | 0.0869 | 0.0272 | 3.19 | 0.0014 |
| meld' | -0.2768 | 0.1518 | -1.82 | 0.0683 |
| meld'' | 1.0900 | 0.5937 | 1.84 | 0.0664 |
| meld''' | -1.5743 | 0.8682 | -1.81 | 0.0698 |
| acr=2 | 0.1596 | 0.1135 | 1.41 | 0.1599 |
| surgeryduration | -0.0008 | 0.0031 | -0.27 | 0.7862 |
| surgeryduration' | -0.0317 | 0.0315 | -1.01 | 0.3132 |
| surgeryduration'' | 0.1333 | 0.0999 | 1.33 | 0.1823 |
| surgeryduration''' | -0.1741 | 0.1129 | -1.54 | 0.1231 |
| vat00 | 0.0006 | 0.0034 | 0.18 | 0.8571 |
| vat00' | -0.0007 | 0.0168 | -0.04 | 0.9644 |
| vat00'' | 0.0024 | 0.0359 | 0.07 | 0.9466 |
| sat00 | -0.0006 | 0.0010 | -0.61 | 0.5449 |
| sat00' | 0.0012 | 0.0011 | 1.11 | 0.2651 |
| s_mhu00 | -0.0103 | 0.0088 | -1.17 | 0.2429 |
| s_mhu00' | 0.0069 | 0.0101 | 0.68 | 0.4950 |
| v_fhu00 | -0.0097 | 0.0121 | -0.81 | 0.4201 |
| v_fhu00' | 0.0024 | 0.0140 | 0.17 | 0.8621 |
| sa_thu00 | 0.0198 | 0.0153 | 1.29 | 0.1955 |
| sa_thu00' | -0.0325 | 0.0586 | -0.56 | 0.5788 |
| sa_thu00'' | 0.0720 | 0.1341 | 0.54 | 0.5915 |
| htn=2 | 0.0767 | 0.0749 | 1.02 | 0.3057 |
| dm=2 | 0.1927 | 0.0893 | 2.16 | 0.0311 |
| ckd=2 | 0.2477 | 0.0856 | 2.89 | 0.0038 |
| time | 0.0007 | 0.0094 | 0.07 | 0.9407 |
| time' | 0.0839 | 0.0839 | 1.00 | 0.3179 |
| time'' | -0.6894 | 0.3443 | -2.00 | 0.0453 |
| time''' | 1.6983 | 0.6257 | 2.71 | 0.0066 |
| prev_state=Hospital | 10.2493 | 0.4679 | 21.91 | <0.0001 |
| prev_state=ICU/Vent | 16.8385 | 0.6320 | 26.64 | <0.0001 |
| sex=2 × sma00 | -0.0126 | 0.0091 | -1.39 | 0.1644 |
| sex=2 × sma00' | 0.0769 | 0.0362 | 2.12 | 0.0337 |
| sex=2 × sma00'' | -0.2512 | 0.1161 | -2.16 | 0.0304 |
| age × sma00 | 0.0011 | 0.0011 | 0.98 | 0.3295 |
| age' × sma00 | -0.0082 | 0.0034 | -2.41 | 0.0160 |
| age'' × sma00 | 0.1810 | 0.0595 | 3.04 | 0.0024 |
| age''' × sma00 | -0.4241 | 0.1329 | -3.19 | 0.0014 |
| age × sma00' | -0.0024 | 0.0037 | -0.64 | 0.5203 |
| age' × sma00' | 0.0254 | 0.0112 | 2.27 | 0.0229 |
| age'' × sma00' | -0.5934 | 0.2066 | -2.87 | 0.0041 |
| age''' × sma00' | 1.4898 | 0.4968 | 3.00 | 0.0027 |
| age × sma00'' | 0.0070 | 0.0113 | 0.62 | 0.5330 |
| age' × sma00'' | -0.0707 | 0.0317 | -2.23 | 0.0258 |
| age'' × sma00'' | 1.6853 | 0.6019 | 2.80 | 0.0051 |
| age''' × sma00'' | -4.3450 | 1.5053 | -2.89 | 0.0039 |
anova(g) %>% kbl(align = 'c') %>% kable_minimal
| Chi-Square | d.f. | P | |
|---|---|---|---|
| ctd | 0.0556328 | 1 | 0.8135366 |
| sex (Factor+Higher Order Factors) | 6.2147866 | 4 | 0.1836718 |
| All Interactions | 6.0870577 | 3 | 0.1074506 |
| age (Factor+Higher Order Factors) | 21.9121324 | 16 | 0.1460514 |
| All Interactions | 19.6799976 | 12 | 0.0733867 |
| Nonlinear (Factor+Higher Order Factors) | 17.3754001 | 12 | 0.1360133 |
| sma00 (Factor+Higher Order Factors) | 38.5155286 | 18 | 0.0033084 |
| All Interactions | 27.6508203 | 15 | 0.0238598 |
| Nonlinear (Factor+Higher Order Factors) | 17.7467359 | 12 | 0.1235997 |
| height | 0.7462041 | 2 | 0.6885949 |
| Nonlinear | 0.2412018 | 1 | 0.6233394 |
| ascites_yn | 0.3484406 | 1 | 0.5549972 |
| he | 3.0213216 | 1 | 0.0821765 |
| race | 1.2699451 | 1 | 0.2597770 |
| smoker | 5.2751429 | 1 | 0.0216320 |
| cit | 2.1911074 | 2 | 0.3343544 |
| Nonlinear | 0.6973596 | 1 | 0.4036723 |
| etiology | 10.9696688 | 4 | 0.0269070 |
| meld | 76.3896162 | 4 | 0.0000000 |
| Nonlinear | 4.8877410 | 3 | 0.1802043 |
| acr | 1.9749246 | 1 | 0.1599261 |
| surgeryduration | 12.1701093 | 4 | 0.0161301 |
| Nonlinear | 10.0172735 | 3 | 0.0184199 |
| vat00 | 0.6702529 | 3 | 0.8801774 |
| Nonlinear | 0.0429306 | 2 | 0.9787634 |
| sat00 | 2.8046040 | 2 | 0.2460299 |
| Nonlinear | 1.2420024 | 1 | 0.2650855 |
| s_mhu00 | 1.6187036 | 2 | 0.4451465 |
| Nonlinear | 0.4656927 | 1 | 0.4949754 |
| v_fhu00 | 1.0336717 | 2 | 0.5964047 |
| Nonlinear | 0.0301790 | 1 | 0.8620847 |
| sa_thu00 | 5.6777162 | 3 | 0.1283873 |
| Nonlinear | 0.3153140 | 2 | 0.8541427 |
| htn | 1.0491994 | 1 | 0.3056915 |
| dm | 4.6496299 | 1 | 0.0310602 |
| ckd | 8.3737018 | 1 | 0.0038069 |
| time | 14.8060627 | 4 | 0.0051208 |
| Nonlinear | 12.4742381 | 3 | 0.0059232 |
| prev_state | 779.7482188 | 2 | 0.0000000 |
| sex * sma00 (Factor+Higher Order Factors) | 6.0870577 | 3 | 0.1074506 |
| Nonlinear | 4.6917607 | 2 | 0.0957629 |
| Nonlinear Interaction : f(A,B) vs. AB | 4.6917607 | 2 | 0.0957629 |
| age * sma00 (Factor+Higher Order Factors) | 19.6799976 | 12 | 0.0733867 |
| Nonlinear | 15.7527076 | 11 | 0.1505627 |
| Nonlinear Interaction : f(A,B) vs. AB | 15.7527076 | 11 | 0.1505627 |
| f(A,B) vs. Af(B) + Bg(A) | 10.1378681 | 6 | 0.1189649 |
| Nonlinear Interaction in age vs. Af(B) | 13.4354105 | 9 | 0.1438736 |
| Nonlinear Interaction in sma00 vs. Bg(A) | 10.5941753 | 8 | 0.2257705 |
| TOTAL NONLINEAR | 75.9768483 | 36 | 0.0001117 |
| TOTAL INTERACTION | 27.6508203 | 15 | 0.0238598 |
| TOTAL NONLINEAR + INTERACTION | 77.8454151 | 38 | 0.0001472 |
| TOTAL | 1235.5124448 | 66 | 0.0000000 |
plot(anova(g), sort = c("ascending"))
prev_state is by far the strongest predictor of state. This makes sense as overtime the status of a patient becomes more stable and does not change much. Therefore, it becomes increasingly likely that the state the previous day is the same as the current state today. The next strongest predictors are meld, sma00 and time.
It is fairly easy to understand each parameter and how important each predictor is in the model, however the overall model is pretty complex with 72 degrees of freedom and many predictors in the form of splines.
Fast backward selection was performed to see if any additional predictors should be removed from the model.
print(fastbw(g, rule=c('p')), estimates=FALSE)
##
## Deleted Chi-Sq d.f. P Residual d.f. P AIC
## vat00 0.67 3 0.8802 0.67 3 0.8802 -5.33
## ctd 0.06 1 0.8055 0.73 4 0.9475 -7.27
## height 0.53 2 0.7688 1.26 6 0.9740 -10.74
## ascites_yn 0.18 1 0.6713 1.44 7 0.9844 -12.56
## htn 0.82 1 0.3665 2.25 8 0.9723 -13.75
## sma00 3.60 3 0.3075 5.86 11 0.8828 -16.14
## s_mhu00 2.67 2 0.2629 8.53 13 0.8076 -17.47
## race 1.00 1 0.3184 9.52 14 0.7961 -18.48
## cit 2.02 2 0.3646 11.54 16 0.7749 -20.46
## sex 0.93 1 0.3337 12.48 17 0.7705 -21.52
## sex * sma00 4.31 3 0.2301 16.78 20 0.6669 -23.22
## he 1.85 1 0.1738 18.63 21 0.6086 -23.37
## sat00 4.56 2 0.1021 23.20 23 0.4492 -22.80
## acr 3.93 1 0.0474 27.13 24 0.2985 -20.87
## time 9.95 4 0.0414 37.07 28 0.1172 -18.93
##
## Factors in Final Model
##
## [1] age smoker etiology meld
## [5] surgeryduration v_fhu00 sa_thu00 dm
## [9] ckd prev_state age * sma00
Many variables were removed from the model in this process. sex was removed, but based on previous studies where sex was of importance, it is important to keep it in the model. Due to the importance of body composition variables to this analysis, all body composition variables were kept as well. However the number of knots for some of these variables can be reduced as they are not as important and will make the model less complex.
The list of variables that are removed based on the fast backward variable selection and domain knowledge are ascites_yn, ctd, htn, cit, race, acr and height. A new model was fit with these predictors removed.
fin <- orm(state ~ (sex + rcs(age,3) + rcs(sma00,3))^2 - rcs(age,3):sex +
he + etiology + rcs(meld,5) + rcs(surgeryduration,3) + rcs(vat00,3) +
rcs(sat00,3) + rcs(s_mhu00,3) + rcs(v_fhu00,3) + rcs(sa_thu00,3) +
dm + ckd + rcs(time,5) + prev_state,
data=data, x=TRUE, y=TRUE, maxit = 30)
final <- robcov(fin, data$patient_mrn)
print(final)
Logistic (Proportional Odds) Ordinal Regression Model
orm(formula = state ~ (sex + rcs(age, 3) + rcs(sma00, 3))^2 -
rcs(age, 3):sex + he + etiology + rcs(meld, 5) + rcs(surgeryduration,
3) + rcs(vat00, 3) + rcs(sat00, 3) + rcs(s_mhu00, 3) + rcs(v_fhu00,
3) + rcs(sa_thu00, 3) + dm + ckd + rcs(time, 5) + prev_state,
data = data, x = TRUE, y = TRUE, maxit = 30)
Frequencies of Responses
Home/IPR Hospital ICU/Vent Dead
34135 3526 1483 9
| Model Likelihood Ratio Test |
Discrimination Indexes |
Rank Discrim. Indexes |
|
|---|---|---|---|
| Obs 39153 | LR χ2 31314.66 | R2 0.913 | ρ 0.576 |
| Distinct Y 4 | d.f. 40 | g 4.173 | |
Cluster on data$patient_mrn |
Pr(>χ2) <0.0001 | gr 64.885 | |
| Clusters 422 | Score χ2 45978.58 | |Pr(Y ≥ median)-½| 0.489 | |
| Y0.5 1 | Pr(>χ2) <0.0001 | ||
| max |∂log L/∂β| 2×10-5 |
| β | S.E. | Wald Z | Pr(>|Z|) | |
|---|---|---|---|---|
| y≥Hospital | -8.0451 | 3.3677 | -2.39 | 0.0169 |
| y≥ICU/Vent | -15.3360 | 3.3775 | -4.54 | <0.0001 |
| y≥Dead | -22.1453 | 3.3857 | -6.54 | <0.0001 |
| sex=2 | -0.3043 | 0.7264 | -0.42 | 0.6753 |
| age | 0.0085 | 0.0616 | 0.14 | 0.8907 |
| age' | -0.0782 | 0.0709 | -1.10 | 0.2703 |
| sma00 | -0.0005 | 0.0212 | -0.02 | 0.9829 |
| sma00' | -0.0205 | 0.0240 | -0.85 | 0.3935 |
| he=2 | 0.1546 | 0.0911 | 1.70 | 0.0897 |
| etiology=2 | 0.2102 | 0.1091 | 1.93 | 0.0540 |
| etiology=3 | 0.1894 | 0.1230 | 1.54 | 0.1236 |
| etiology=4 | 0.1987 | 0.1447 | 1.37 | 0.1699 |
| etiology=5 | -0.0597 | 0.1139 | -0.52 | 0.6002 |
| meld | 0.0872 | 0.0259 | 3.37 | 0.0008 |
| meld' | -0.2959 | 0.1570 | -1.88 | 0.0594 |
| meld'' | 1.1717 | 0.6187 | 1.89 | 0.0582 |
| meld''' | -1.7115 | 0.8964 | -1.91 | 0.0562 |
| surgeryduration | -0.0034 | 0.0014 | -2.41 | 0.0160 |
| surgeryduration' | 0.0047 | 0.0014 | 3.28 | 0.0010 |
| vat00 | -0.0002 | 0.0020 | -0.09 | 0.9296 |
| vat00' | 0.0010 | 0.0026 | 0.37 | 0.7088 |
| sat00 | -0.0004 | 0.0010 | -0.36 | 0.7192 |
| sat00' | 0.0007 | 0.0011 | 0.66 | 0.5083 |
| s_mhu00 | -0.0076 | 0.0083 | -0.91 | 0.3621 |
| s_mhu00' | 0.0042 | 0.0100 | 0.42 | 0.6730 |
| v_fhu00 | -0.0057 | 0.0137 | -0.42 | 0.6771 |
| v_fhu00' | -0.0051 | 0.0146 | -0.35 | 0.7280 |
| sa_thu00 | 0.0106 | 0.0092 | 1.16 | 0.2478 |
| sa_thu00' | 0.0011 | 0.0116 | 0.09 | 0.9260 |
| dm=2 | 0.1950 | 0.0862 | 2.26 | 0.0238 |
| ckd=2 | 0.2232 | 0.0814 | 2.74 | 0.0061 |
| time | 0.0093 | 0.0097 | 0.96 | 0.3386 |
| time' | 0.0533 | 0.0821 | 0.65 | 0.5157 |
| time'' | -0.6073 | 0.3362 | -1.81 | 0.0708 |
| time''' | 1.6001 | 0.6199 | 2.58 | 0.0098 |
| prev_state=Hospital | 10.3355 | 0.4697 | 22.01 | <0.0001 |
| prev_state=ICU/Vent | 16.9626 | 0.6413 | 26.45 | <0.0001 |
| sex=2 × sma00 | 0.0019 | 0.0052 | 0.36 | 0.7176 |
| sex=2 × sma00' | 0.0016 | 0.0086 | 0.18 | 0.8575 |
| age × sma00 | -0.0002 | 0.0004 | -0.38 | 0.7012 |
| age' × sma00 | 0.0007 | 0.0005 | 1.37 | 0.1695 |
| age × sma00' | 0.0005 | 0.0005 | 1.00 | 0.3168 |
| age' × sma00' | -0.0009 | 0.0006 | -1.34 | 0.1789 |
This model has a very high adjusted \(R^2\) of 0.913, meaning 91.3% of the variance in the state is explained by the model. It also has a \(\rho\) value of 0.576 which is a measure of rank discrimination or how well the model can differentiate different levels of state. the overall LR \(X^2\) is also very high with a value of 31,314.66 which shows there is definitely a significant association of at least one predictor in the model with the outcome state.
Now that I have chosen my final model, I want to validate that model to check for overfitting using bootstrap validation. Because I performed fast backward selection, I must penalize this variable selection process so that my standard errors and confidence intervals are not too small.
#could only get the code to work on subset of data with 65 days post liver transplant instead of
# the full 90 days used in the rest of the analysis
sub_data = subset(data, time < 66)
sub <- orm(state ~ (sex + rcs(age,3) + rcs(sma00,3))^2 - rcs(age,3):sex +
he + etiology + rcs(meld,5) + rcs(surgeryduration,3) + rcs(vat00,3) +
rcs(sat00,3) + rcs(s_mhu00,3) + rcs(v_fhu00,3) + rcs(sa_thu00,3) +
dm + ckd + rcs(time,5) + prev_state,
data=sub_data, x=TRUE, y=TRUE, maxit = 30)
subset <- robcov(sub, sub_data$patient_mrn)
set.seed(1993)
v <- validate(subset, B=100, estimates=FALSE, bw=TRUE, rule='p', tol=1e-17, maxit=40)
##
## Backwards Step-down - Original Model
##
## Deleted Chi-Sq d.f. P Residual d.f. P AIC
## vat00 0.42 2 0.8123 0.42 2 0.8123 -3.58
## sex * sma00 0.95 2 0.6216 1.37 4 0.8499 -6.63
## sex 0.03 1 0.8733 1.39 5 0.9252 -8.61
## s_mhu00 1.13 2 0.5697 2.52 7 0.9258 -11.48
## sat00 2.01 2 0.3668 4.52 9 0.8737 -13.48
## sma00 2.65 2 0.2664 7.17 11 0.7852 -14.83
## age 3.79 2 0.1504 10.96 13 0.6143 -15.04
## etiology 7.92 4 0.0946 18.88 17 0.3356 -15.12
## he 2.71 1 0.0999 21.58 18 0.2510 -14.42
## v_fhu00 5.64 2 0.0595 27.23 20 0.1290 -12.77
## dm 3.15 1 0.0760 30.37 21 0.0847 -11.63
##
## Factors in Final Model
##
## [1] meld surgeryduration sa_thu00 ckd
## [5] time prev_state age * sma00
v
## index.orig training test optimism index.corrected n
## rho 0.6687 0.6688 0.6685 0.0002 0.6684 100
## R2 0.9138 0.9147 0.9133 0.0013 0.9124 100
## Slope 1.0000 1.0000 0.9779 0.0221 0.9779 100
## g 4.9219 5.0664 4.9443 0.1221 4.7998 100
## pdm 0.4844 0.4845 0.4843 0.0001 0.4842 100
##
## Factors Retained in Backwards Elimination
##
## sex age sma00 he etiology meld surgeryduration vat00 sat00 s_mhu00 v_fhu00
## * * *
## * *
## * * * *
## * *
## * * * *
## * * *
## * * * *
## * * * *
## * * *
## * * * *
## * * *
## * * *
## * * * * *
## * * * *
## *
## * * * * * * *
## * * * * *
## * * *
## * * * * *
## * * *
## * * * * *
## * * *
## * * *
## *
## * * * *
## * * * *
## * * *
## * * * * * *
## * *
## * * *
## * * * *
## * * * *
## * *
## * * * * *
## * * *
## * * *
## * * *
## * * *
## * *
## * * *
## * * * * *
## * * *
## * * * *
## * *
## * * * *
## * * * *
## * *
## * * * *
## * *
## * * * * *
## * * * * *
## * * * * *
## * * * *
## * * * * * *
## * * *
## * *
## * * * * * *
## * * * *
## * *
## * *
## * * * *
## * *
## * * *
## * * * * *
## * * * * * *
## * *
## * * * *
## * * * *
## * * *
## * * * * * *
## * * *
## * * * *
## * * * *
## * * *
## * * *
## * * * * *
## * * * *
## * * * *
## * * * *
## * * *
## * *
## * * * *
## * *
## * * * *
## * *
## * * *
## *
## * * * * *
## * * *
## * * * *
## * * *
## * * * *
## * *
## * * *
## * * *
## * * *
## * *
## * *
## * * * * *
## * * * * *
## sa_thu00 dm ckd time prev_state sex * sma00 age * sma00
## * * * *
## * * * * *
## * * *
## * * * * *
## * * * * *
## * * *
## * * * * *
## * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * *
## * * * *
## * * * *
## * * * * *
## * * * *
## * * * * * *
## * * * *
## * * * * * *
## * * * *
## * * * *
## * * * *
## * * * *
## * * * * * *
## * * * * * *
## * * * * *
## * * * *
## * * * * * *
## * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * *
## * * * * * *
## * * * *
## * * * *
## * * * *
## * * * * *
## * * * *
## * * * * *
## * * * * * *
## * * *
## * * * * *
## * * * *
## * * * *
## * * *
## * * * * *
## * * * *
## * * * * *
## * * * * * *
## * * * *
## * * * * *
## * * * * *
## * * * * *
## * * * * * *
## * * * *
## * * * * *
## * * * *
## * * * * *
## * * * *
## * * * *
## * * * * * *
## * * *
## * * * * * *
## * * * *
## * * * * * * *
## * * *
## * * * * *
## * * * * * *
## * * * * *
## * * * *
## * * * * *
## * * * *
## * * * * *
## * * *
## * * * *
## * * * * *
## * * *
## * * * *
## * * *
## * * * * * *
## * * * * * *
## * * * * * *
## * * * *
## * * *
## * * * * *
## * * *
## * * * *
## * * *
## * * * *
## * * *
## * * * *
## * * * *
## * * * *
## * * *
## * * * * *
## * *
## * * * *
##
## Frequencies of Numbers of Factors Retained
##
## 4 5 6 7 8 9 10 11 12
## 1 4 11 29 20 18 7 8 2
#couldn't figure out how to get a nice html version to print
#html(v)
Based on the low optimism for each index and the corrected \(R^2\), \(\rho\) and Slope, this model validates well and is not prone to very much overfitting. Just like backwards selection, validation shows that many predictors could be removed from the model. However, due to low risk of overfitting it is okay to keep these variables in my final model to understand the association and importance of all predictors.
an2 <- anova(final)
an2 %>% kbl(align = 'c') %>% kable_minimal
| Chi-Square | d.f. | P | |
|---|---|---|---|
| sex (Factor+Higher Order Factors) | 0.7494969 | 3 | 0.8615045 |
| All Interactions | 0.7490620 | 2 | 0.6876117 |
| age (Factor+Higher Order Factors) | 10.4353440 | 6 | 0.1074757 |
| All Interactions | 9.5183081 | 4 | 0.0493724 |
| Nonlinear (Factor+Higher Order Factors) | 4.8486817 | 3 | 0.1832187 |
| sma00 (Factor+Higher Order Factors) | 22.1511640 | 8 | 0.0046434 |
| All Interactions | 10.0150199 | 6 | 0.1240209 |
| Nonlinear (Factor+Higher Order Factors) | 2.0438385 | 4 | 0.7276959 |
| he | 2.8803800 | 1 | 0.0896649 |
| etiology | 8.4868835 | 4 | 0.0752858 |
| meld | 72.0974262 | 4 | 0.0000000 |
| Nonlinear | 7.4690505 | 3 | 0.0583590 |
| surgeryduration | 14.4345159 | 2 | 0.0007338 |
| Nonlinear | 10.7830563 | 1 | 0.0010243 |
| vat00 | 0.4059105 | 2 | 0.8163148 |
| Nonlinear | 0.1394480 | 1 | 0.7088304 |
| sat00 | 1.0408100 | 2 | 0.5942798 |
| Nonlinear | 0.4374906 | 1 | 0.5083362 |
| s_mhu00 | 1.1059662 | 2 | 0.5752313 |
| Nonlinear | 0.1781522 | 1 | 0.6729661 |
| v_fhu00 | 1.2185864 | 2 | 0.5437351 |
| Nonlinear | 0.1209744 | 1 | 0.7279801 |
| sa_thu00 | 4.7485704 | 2 | 0.0930810 |
| Nonlinear | 0.0086290 | 1 | 0.9259891 |
| dm | 5.1118621 | 1 | 0.0237628 |
| ckd | 7.5139638 | 1 | 0.0061222 |
| time | 13.9236378 | 4 | 0.0075428 |
| Nonlinear | 13.6332339 | 3 | 0.0034494 |
| prev_state | 758.6366151 | 2 | 0.0000000 |
| sex * sma00 (Factor+Higher Order Factors) | 0.7490620 | 2 | 0.6876117 |
| Nonlinear | 0.0322579 | 1 | 0.8574629 |
| Nonlinear Interaction : f(A,B) vs. AB | 0.0322579 | 1 | 0.8574629 |
| age * sma00 (Factor+Higher Order Factors) | 9.5183081 | 4 | 0.0493724 |
| Nonlinear | 2.1017528 | 3 | 0.5515582 |
| Nonlinear Interaction : f(A,B) vs. AB | 2.1017528 | 3 | 0.5515582 |
| f(A,B) vs. Af(B) + Bg(A) | 1.8064724 | 1 | 0.1789320 |
| Nonlinear Interaction in age vs. Af(B) | 1.9796534 | 2 | 0.3716411 |
| Nonlinear Interaction in sma00 vs. Bg(A) | 1.8979262 | 2 | 0.3871422 |
| TOTAL NONLINEAR | 42.8295340 | 18 | 0.0008458 |
| TOTAL INTERACTION | 10.0150199 | 6 | 0.1240209 |
| TOTAL NONLINEAR + INTERACTION | 46.8468179 | 20 | 0.0006160 |
| TOTAL | 1119.1294757 | 40 | 0.0000000 |
The previous state dominates the model with a \(\chi^2\) of 759. There is evidence of nonlinearity in the model as well as nonlinearity and interaction, but not for interaction alone. Interaction for age and skeletal muscle mass is significant based on \(\alpha = 0.05\), but no other interaction terms were sigificant. Predictors that have a nonlinear relationship with state include meld, surgeryduration and time.
There are many different body composition measures in this analysis. Since I want to understand how overall body composition effects state outcomes for each day during the 90 day period post liver transplant, I did a combined chunk test of all body composition variables as well as looking at each model predictor individually.
b2 <- anova(final, sma00, vat00, sat00, s_mhu00, v_fhu00, sa_thu00)
b2 %>% kbl(align = 'c') %>% kable_minimal
| Chi-Square | d.f. | P | |
|---|---|---|---|
| sma00 (Factor+Higher Order Factors) | 22.1511640 | 8 | 0.0046434 |
| All Interactions | 10.0150199 | 6 | 0.1240209 |
| Nonlinear (Factor+Higher Order Factors) | 2.0438385 | 4 | 0.7276959 |
| vat00 | 0.4059105 | 2 | 0.8163148 |
| Nonlinear | 0.1394480 | 1 | 0.7088304 |
| sat00 | 1.0408100 | 2 | 0.5942798 |
| Nonlinear | 0.4374906 | 1 | 0.5083362 |
| s_mhu00 | 1.1059662 | 2 | 0.5752313 |
| Nonlinear | 0.1781522 | 1 | 0.6729661 |
| v_fhu00 | 1.2185864 | 2 | 0.5437351 |
| Nonlinear | 0.1209744 | 1 | 0.7279801 |
| sa_thu00 | 4.7485704 | 2 | 0.0930810 |
| Nonlinear | 0.0086290 | 1 | 0.9259891 |
| TOTAL NONLINEAR | 3.1342971 | 9 | 0.9587315 |
| TOTAL | 36.3458895 | 18 | 0.0063697 |
With a \(\chi^2\) of 36.3 and a p-value of 0.0063 there does seem to be an association with overall body composition and liver transplant outcome states. However, in looking at each body composition measure separately, skeletal muscle area (sma00) seems to be the only one to have an association with liver transplant outcome states.
Below we can see a plot of predictor importance taking into account nonlinearity and interactions as well as adding the measures for the overall body composition chunk test.
s2 <- rbind(an2, `body composition`=b2['TOTAL', ])
class(s2) <- 'anova.rms'
plot(s2, main = "Testing Statistical Importance", sort = c("ascending"))
Based on \(\alpha = 0.05\), the significant predictors are prev_state, MELD score (meld), skeletal muscle area (sma00), surgeryduration, time, chronic kidney disease (ckd), age and sma00 interaction, and diabetes (dm)
As expected, the most important parameter in my model is prev_state. The combined chunk test for body composition has a \(\chi^2\) = 36.3 and a p-value of 0.0064.
The below plots show the partial effects of each of the predictors in the final model on a mean scale.
ggplot(Predict(final), vnames='names', sepdiscrete='vertical')
The partial effects plots show a non-flat relationship with only time, prev_state and very slightly with meld. This makes sense as these are 3 of the most important predictors in the model. The effect size seems very small except for prev_state and time after 50 days meaning although there were many significant predictors that seem to have an association with outcome states, they have negligible effects on changing odds of outcome states based on changes in the predictors.
Below are plots showing the baseline covariate effects (log odds) for days 1, 15, 30, 45, 60, and 85 from the model for some of the most significant predictors.
t <- c(1, 15, 30, 45, 60, 75)
ggplot(Predict(final, time = t, prev_state, fun=plogis))
Interestingly, overtime the odds of the previous state being ICU/Vent or Home/IPR don’t change, however, odds of being in the hospital seem to increase slightly at first and then decrease after 45 days post liver transplant.
ggplot(Predict(final, time = t, meld, fun=plogis))
ggplot(Predict(final, time = t, sma00, fun=plogis))
ggplot(Predict(final, time = t, surgeryduration, fun=plogis))
All three of the continuous predictors above seem to have wdie confidence intervals and negligible differences over time. As MELD scores increase, the odds increases at a very low amount and skeletal muscle area has the opposite effect. As surgery duration increases, log odds decreases slightly and then begins to increase but this also has large confidence intervals and such low effect of any change.
ggplot(Predict(final, time = t, ckd, fun=plogis))
ggplot(Predict(final, time = t, etiology, fun=plogis))
ggplot(Predict(final, time = t, dm, fun=plogis))
Confidence intervals seem to heavily overlap for all three categorical variables above without any interesting trends.
ggplot(Predict(final, time = t, age, fun=plogis))
Age also has extrememly low effect on changes in log odds. It seems for patients above 60 that log odds may increase slightly, but we can only say this with extremely low confidence and low effect size.
Here is a plot of the interquartile-range odds ratios for continuous predictors and simple odds ratios for categorical predictors. The numbers on the left axis next to each variable name indicate the upper and lower quartile ranges for continuous variables and the current versus reference group for categorical variables. The ranges on the left are in the original scale while the intervals that are plotted are on the log odds ratio scale.
plot(summary(final), log=TRUE)
This shows similar results to the partial effects plots above, as the only odds ratios not hovering around 1 are prev_state and time meaning no other predictors really influence any changes in log odds.
The odds ratios output from the model can be converted to probabilities for ease of interpretation. You can look at any of the following probabilities.
Here is a plot of the probability of being at home or in an inpatient long term care facility averaging over all the patients for each day. The probability is extremely low at first as all patients start in the ICU post liver transplant and then it increases reaching practically 1 by around 50 days post transplant.
p <- predict(final, type='fitted.ind')
options(scipen = 50)
x <- as.data.frame(p)
y <- cbind(x, data)
prob_home <- y %>%
group_by(time) %>%
summarise_at(vars(`state=Home/IPR`), funs(mean(., na.rm=TRUE)))
ggplot(prob_home, aes(x = time, y = `state=Home/IPR`)) + geom_line() +
labs(title = "Probability of Being at Home/IPR", x = "Day Post Liver Transplant",
y = "P(state = Home/IPR)")
Below is a plot of the mean, median and 90th percentile where skeletal muscle area is varied but all other predictors are held constant to either medians for continuous variables or modes for categorical variables. This plot is deceiving as it looks like as skeletal muscle mass increases the output log odds decreases, however in looking at the y-axis scale it shows it really only decreases by less than a thousandth.
M <- Mean(final, codes=TRUE)
qu <- Quantile(final, codes=TRUE)
med <- function(x) qu(.5 , x)
p90 <- function(x) qu(.9 , x)
pmean <- Predict(final, sma00 , fun=M, conf.int = FALSE )
pmed <- Predict(final, sma00 , fun=med , conf.int = FALSE )
p90 <- Predict(final, sma00 , fun=p90 , conf.int = FALSE )
z <- rbind('orm mean '=pmean , 'orm median '=pmed , 'orm P90 '=p90 )
ggplot(z, groups ='.set.',
adj.subtitle =FALSE , legend.label = FALSE )
The plots below examine the ordinality of state with respect to each predictor through assessing how levels of state vary related to the mean of each predictor. A monotonic relationship means that ordinality holds. The solid lines connect stratified means while the dotted lines connect the estimated expected values of \(X|Y = j\) give that the porportional odds assumption holds.
datad <- datadist(data)
options(datadist="datad")
plot.xmean.ordinaly(state ~ sex + age + sma00 + he + etiology + meld +
surgeryduration + vat00 + sat00 + s_mhu00 + v_fhu00 +
sa_thu00 + dm + ckd + time, data = data, cr=TRUE,
subn = FALSE, cex.points = 0.65)
Ordinality only seems to hold for surgeryduration and the proportional odds assumption does not seem to hold for any of the predictors. It holds fairly well for time over the first 3 states but then not for the state “Dead”. This is most likely due to only a small sample of patients dying in the first 90 days post liver transplant. The means for death are all extremely off their estimated expected values but due to lack of data for that category these means are hard to trust.
Below is an additional plot to try and understand if the proportional odds assumption holds for each of the predictors. The plus sign, triangle and circle correspond to different levels of state. When the proportional odds assumption holds, the differences in logits between different X values should be the same accross all levels of that predictor.
state2 = as.integer(data$state) - 1
sf = function(y) {
c('Y>=1' = qlogis(mean(y >= 1)),
'Y>=2' = qlogis(mean(y >= 2)),
'Y>=3' = qlogis(mean(y >= 3)))}
s1 = summary(state2 ~ sex + age + sma00 + he + etiology + meld +
surgeryduration + vat00 + sat00 + s_mhu00 + v_fhu00 +
sa_thu00 + dm + ckd + time + prev_state, fun = sf, data)
plot(s1, which = 1:3, pch = 1:3, xlab = 'logit', vnames = 'names', main = '',
width.factor = 1.5, xlim=c(-10, 0))
From this plot we can see that proportional odds does not seem to hold for many of the variables. time, and meld seem especially off at some levels, but sex, etiology and dm are the only predictors that seem to reliably satisfy proportional odds.
Because patient’s states tend to stabilize with time, I am interested in running a simulation to understand whether the amount days post liver transplant in the data set impacts the importance of predictors in the model. This could also impact effect size. I hypothesize that the effect size of time will decrease as states are less stable closer to the liver transplant.
I will run a simulation of 100 bootstrap samples, stratified by patient, for each of 8 different end time points and analyze the odds ratios and \(\chi^2\) values for each number of days used in the model.
Although the model is not as flexible, for purposes of reducing complexity of this analysis, I used only linear predictors and no interaction for this simulation. This way each predictor will only have one coefficient in the model and it will be a bit easier to understand how that one coefficient changes with different numbers of days in the model.
lin <- orm(state ~ sex + age + sma00 + he + etiology + meld + surgeryduration + vat00 +
sat00 + s_mhu00 + v_fhu00 + sa_thu00 + dm + ckd + time + prev_state,
data=data, x=TRUE, y=TRUE, maxit = 40, tol=1e-17)
linear <- robcov(lin, data$patient_mrn)
an_l <- anova(linear)
N <- 100 #number of bootstrap samples
t = seq(20, 90, 10) #different periods of time post liver transplant
#df to capture for coefficients
col_names_coef_l <- append(names(linear$coefficients), c('days', 'N'))
df_coef_l <- data.frame(matrix(ncol = length(col_names_coef_l), nrow = 0))
colnames(df_coef_l) <- col_names_coef_l
#df to capture for testing
b_l <- as.data.frame(an_l)
b_l$var <- row.names(b_l)
col_names_stat_l <- append(colnames(b_l), c('days', 'N'))
df_stat_l <- data.frame(matrix(ncol = length(col_names_stat_l), nrow = 0))
colnames(df_stat_l) <- col_names_stat_l
patients <- data$patient_mrn %>% unique()
for (i in 1:N){
#resampling by patient not observation
ps <- sample(patients, replace = TRUE)
s <- data.frame(matrix(ncol = length(colnames(data)), nrow = 0))
colnames(s) <- colnames(data)
for (p in ps){
d <- data %>% filter(patient_mrn == p)
s <- rbind(s, d)
}
for (j in t){
#subset to data up to time t
dat <- subset(s, time < j+1)
#fit model
l <- orm(state ~ sex + age + sma00 + he + etiology + meld + surgeryduration + vat00 +
sat00 + s_mhu00 + v_fhu00 + sa_thu00 + dm + ckd + time + prev_state,
data=data, x=TRUE, y=TRUE, maxit = 40, tol=1e-17)
l2 <- robcov(l, data$patient_mrn)
#model coefficients
coef <- l2$coefficients
c <- as.data.frame(transpose(as.list(coef)))
c$days <- j
c$N <- i
df_coef_l <- rbind(df_coef_l, c)
#stat testing
a <- as.data.frame(anova(l2))
a$var <- row.names(a)
a$days <- j
a$N <- i
df_stat_l <- rbind(df_stat_l, a)
}
}
saveRDS(df_coef_l, 'SimCoef_linear.rds', compress='xz')
saveRDS(df_stat_l, 'SimStat_linear.rds', compress='xz')
Below are plots for all predictors showing the mean coefficient in the model as well as plots for all the predictors showing chi-squared values. These are for models fit across different amounts of days post-liver transplant. These results show that the importance and effect of predictors does not change depending on the number of days post liver transplant included in the analysis as all the plots show flat lines.
#read in simulation results
df_coef_l <- readRDS('SimCoef_linear.rds')
df_stat_l <- readRDS('SimStat_linear.rds')
#find mean odds ratios for each parameter
t = seq(20, 90, 10)
df_mean_coef_l <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(df_mean_coef_l) <- c("mean", "var", "days")
for (i in t){
df_l <- df_coef_l %>% filter(days == i)
mean_df_l <- as.data.frame(colMeans(df_l))
colnames(mean_df_l) <- c("mean")
mean_df_l$var <- row.names(mean_df_l)
mean_df_l$days <- i
df_mean_coef_l <- rbind(df_mean_coef_l, mean_df_l)
}
vars <- mean_df_l$var %>% unique()
for (v in vars){
g <- df_mean_coef_l %>% filter(var == v) %>% ggplot(aes(x = days, y = mean)) +
geom_point() + labs(y = "Mean Odds Ratio", x = "Number of Days Included in Sample",
title = paste0("Examining ", v))
print(g)
}
vars <- df_stat_l$var %>% unique()
for (v in vars){
g <- df_stat_l %>% filter(var == v) %>% ggplot(aes(x = days, y = `Chi-Square`)) + geom_point() +
labs(x = "Number of Days Included in Sample", title = paste0("Does Importance of ",v," Change?"))
print(g)
}